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SPATIAL AGGREGATION OF LOCAL LIKELIHOOD ESTIMATES 
WITH APPLICATIONS TO CLASSIFICATION 1 

By Denis Belomestny and Vladimir Spokoiny 

Weierstrass Institute 

This paper presents a new method for spatially adaptive local 
(constant) likelihood estimation which applies to a broad class of 
nonparametric models, including the Gaussian, Poisson and binary 
response models. The main idea of the method is, given a sequence 
of local likelihood estimates ("weak" estimates), to construct a new 
aggregated estimate whose pointwise risk is of order of the smallest 
risk among all "weak" estimates. We also propose a new approach 
toward selecting the parameters of the procedure by providing the 
prescribed behavior of the resulting estimate in the simple paramet- 
ric situation. We establish a number of important theoretical results 
concerning the optimality of the aggregated estimate. In particular, 
our "oracle" result claims that its risk is, up to some logarithmic mul- 
tiplier, equal to the smallest risk for the given family of estimates. 
The performance of the procedure is illustrated by application to the 
classification problem. A numerical study demonstrates its reasonable 
performance in simulated and real-life examples. 

1. Introduction. This paper presents a new method of spatially adap- 
tive nonparametric estimation based on the aggregation of a family of local 
likelihood estimates. As a main application of the method, we consider the 
problem of building a classifier on the base of the given family of fe-NN or 
kernel classifiers. 

The local likelihood approach has been intensively discussed in recent 
years; see, for example, Tibshirani and Hastie [16], Staniswalis [15] and 
Loader [11]. We refer to Fan, Farmen and Gijbels [5] for a nice and de- 
tailed overview of the local maximum likelihood approach and related lit- 
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erature. Similarly to nonparametric smoothing in the regression or density 
framework, an important issue for local likelihood modeling is the choice 
of localization (smoothing) parameters. Different types of model selection 
techniques based on the asymptotic expansion of the local likelihood are 
mentioned in Fan, Farmen and Gijbels [5], which include global, as well as 
variable, bandwidth selection. However, the finite sample performance of 
estimators based on bandwidth or model selection is often rather unstable; 
see, for example, Breiman [2]. This point is particulary critical for the local 
or pointwise model selection procedures like Lepski's method. In spite of the 
nice theoretical properties (see Lepski, Mammen and Spokoiny [8], Lepski 
and Spokoiny [9] or Spokoiny [14]), the resulting estimates suffer from high 
variability due to a pointwise model choice, especially for a large noise level. 
This suggests that in some cases, the attempt to identify the true model is 
not necessarily appropriate. One approach to reducing variability in adap- 
tive estimation is model mixing or aggregation. Catoni [4] and Yang [18], 
among others, have suggested global aggregating procedures that achieve 
the minimal estimation risks over the family of given "weak" estimates. In 
the regression setting, Juditsky and Nemirovski [7] have developed aggre- 
gation procedures which have a risk within a multiple of the smallest risk 
in the class of all convex combinations of "weak" estimates plus log(n)/n. 
Tsybakov [17] has discussed asymptotic minimax rates for aggregation. Ag- 
gregation for density estimation has been studied by Li and Barron [10] and 
more recently by Rigollet and Tsybakov [13]. To the best of our knowledge, 
pointwise aggregation has not yet been considered. 

Our approach is based on the idea of the spatial (pointwise) aggregation 
of a family of local likelihood estimates ("weak" estimates) 9^ k \ The main 
idea is, given the sequence {9^}, to construct in a data-driven way, for every 
point x, the "optimal" aggregated estimate 6{x). "Optimality" means that 
this estimate satisfies some kind of "oracle" inequality, that is, its pointwise 
risk does not exceed the smallest pointwise risk among all "weak" estimates 
up to a logarithmic multiple. 

Our algorithm can be roughly described as follows. Let {6^ k \x)}, k = 
1,...,K, be a sequence of "weak" local likelihood estimates at a point x, 
ordered according to their variability, which decreases with k. Starting with 
6^ l \x) = 6 <yl \x), an aggregated estimate 0^ k \x) at any step 1 < k < K 
is constructed by mixing the previously constructed aggregated estimate 
0( k ~ 1 \x) with the current "weak" estimate 0^ k \x), 

e^(x) = lk ¥ k \x) + (i - 7fe )£ (fc - 1) (*), 

and taking 9^ K \x) as a final estimate. The mixing parameter 7^ (which 
may depend on the point x) is defined using a measure of statistical dif- 
ference between 9^ k ~ 1 \x) and 8^ k \x). In particular, 7^ is equal to zero if 
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6^ k ~ l \x) lies outside the confidence set around 9^ k \x). In view of the se- 
quential and pointwise nature of the algorithm, the suggested procedure is 
called Spatial Stagewise Aggregation (SSA). Important features of the pro- 
posed procedure are its simplicity and applicability to a variety of problems 
including Gaussian, binary, Poisson regression, density estimation, classi- 
fication, etc. The procedure does not require any splitting of the sample 
as many other aggregation procedures do (cf. Yang [18]). Furthermore, the 
theoretical properties of SSA can be rigorously studied. In particular, we 
establish precise nonasymptotic "oracle" results which are applicable under 
very mild conditions in a rather general setting. We also show that the oracle 
property automatically implies spatial adaptivity of the proposed estimate. 

Another important feature of the procedure is that it can be easily imple- 
mented and the problem of selecting the tuning parameters can be carefully 
addressed. 

Our simulation study confirms nice finite sample performance of the pro- 
cedure for a broad class of models and problems. We only show the re- 
sults for the classification problem as the most interesting and difficult one. 
Some more examples for univariate regression and density estimation can 
be found in our preprint Belomestny and Spokoiny [1]. Section 4 shows how 
the SSA procedure can be applied to aggregating kernel and fc-NN classifiers 
in the classification problem. Although these two nonparametric classifiers 
are rather popular, the problem of selecting the smoothing parameter (the 
bandwidth for the kernel classifier or the number of neighbors for the /c-NN 
method) has not yet been satisfactorily addressed. Again, the SSA-based 
classifier demonstrates the "oracle" quality in terms of the both pointwise 
and global misclassification errors. This application clearly shows one more 
important feature of the SSA method: it can be applied to an arbitrary 
design with design space of arbitrary dimension. This is illustrated by sim- 
ulated and real-life classification examples in dimensions up to 10. 

The procedure proposed in this paper is limited to aggregating the kernel- 
type estimates which are based on local constant approximation. The mod- 
ern statistical literature usually considers the more general local linear (poly- 
nomial) approximation of the underlying function. However, for this paper, 
we have decided for several reasons to restrict our attention to the local 
constant case. The most important one is that for the examples and ap- 
plications we consider in this paper, the use of the local linear methods 
does not improve (and even degrades) the quality of estimation. Our expe- 
rience strongly confirms that for problems like classification, local constant 
smoothing combined with the aggregation technique delivers reasonable fi- 
nite sample quality. 

Our theoretical study is split into two major parts. Section 2 introduces 
the local parametric setting to be considered and extends the parametric 
risk bounds to the local parametric and nonparametric situation under the 
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so-called "small modeling bias" condition. The main result (Corollary 2.6) 
claims that the parametric risk bounds continue to apply provided that this 
condition is fulfilled. One possible interpretation of our adaptive procedure 
is the search for the largest localizing scheme for which the "small mod- 
eling bias" condition still holds. Theoretical properties of the aggregation 
procedure are presented in Section 5. The main result states the "oracle" 
property of the SSA estimate: the risk of the aggregated estimate is, within a 
log-multiple, as small as the risk of the best "weak" estimate for the function 
under consideration. The results are established in the precise nonasymptotic 
way for a rather general likelihood setting under mild regularity conditions. 
Moreover, our approach establishes a link between parametric and nonpara- 
metric theory. In particular, we show that the proposed method delivers 
root-n accuracy in the parametric situation. In the nonparametric case, the 
quality corresponds to the best parametric approximation. Both the theo- 
retical study and the motivation of the procedure employ some exponential 
bounds for the likelihood which are given in Section 2.2. An important fea- 
ture of our theoretical study is that the problem of selecting the tuning 
parameters is also discussed in detail. We offer a new approach, in which 
the parameters of the procedure are selected to provide the desirable perfor- 
mance of the method in the simple parametric situation. This is similar to 
the hypothesis problem approach when the critical values are selected using 
the performance of the test statistic under the simple null hypothesis; see 
Section 3.3.1 for a detailed explanation. 

2. Local constant modeling for exponential families. This section presents 
some results on local constant likelihood estimation. We begin by describing 
the model under consideration. Suppose we are given independent random 
data Zi,...,Z n of the form Z{ = (Xi,Yi). Here, every Xi denotes a vector 
of "features" or explanatory variables which determine the distribution of 
the "observation" Yi. For simplicity, we assume that the X^s are valued in 
the finite-dimensional Euclidean space SC = M. d and that the 1^'s belong to 
R. The vector Xi can be viewed as a location and Yi as the "observation at 
-Xj." Our model assumes that the distribution of each Yi is determined by a 
finite-dimensional parameter 8 which may depend on the location Xj. 

More precisely, let 9* = (P$,8 € O) be a parametric family of distributions 
dominated by a measure P. In this paper, we only consider the case when O 
is a subset of the real line. By p(-,6) we denote the corresponding density. 
We consider the regression-like model in which every "response" Yi is, condi- 
tionally on Xi = x, distributed with the density p(-, /(&)) for some unknown 
function f(x) on X with values in O. The model under consideration can 
be written as 



Yi~Pf( Xi y 
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The aim of the data analysis is to estimate the function f(x). For related 
models, see Fan and Zhang [6] and Cai, Fan and Li [3]. 

In this paper, we focus on the case where is an exponential family. This 
means that the density functions p(y,0) = ~[p-(y) are of the form p(y,9) = 
p(y) e y c W~ B W.Rere, C(9) and B(9) are some given nondecreasing functions 
on and p(y) is some nonnegative function on R. 

A natural parametrization for this family is defined by the equality EgY = 
/ UP{y-> 9)P{dy) = 9, for all 9 € 0. This condition is useful because the weighted 
average of observations is a natural unbiased estimate of 9. In what follows, 
we assume that 3* also satisfies the following regularity conditions. 

(Al) & = (Pg,9 € C R) is an exponential family with a natural para- 
metrization and the functions B(-) and C(-) are continuously differentiable. 

(A2) is compact and convex and the Fisher information 1(9) : = 
Ee|i91ogp(y, 9)/d6\ 2 satisfies, for some k> 1, 

\i{e')/i{e")\^ 2 <x, e',e"e&. 

We illustrate this setup with two examples relevant to the applications 
we consider below. More examples can be found in Fan, Farmen and Gijbels 
[5] and Polzehl and Spokoiny [12]. 

Example 2.1 [Inhomogeneous Bernoulli (binary response) model]. Let 
Zi = (Xj, Yj) with Xi E R rf and Y{ a Bernoulli r.v. with parameter /(Aj), that 
is, P(Yi = l\Xi = x) = f(x) and P(Yj = | X; = x) = 1 - f(x). Such models 
arise in many econometric applications and are widely used in classification 
and digital imaging. 

Example 2.2 (Inhomogeneous Poisson model). Suppose that every Y{ 
is valued in the set N of nonnegative integers and P(Yi = k \ Aj = x) = 
f k (x)e~f( x ^ /kl, that is, Yi follows a Poisson distribution with parameter 9 = 
f(x). This model is commonly used in queueing theory, occurs in positron 
emission tomography and also serves as an approximation for the density 
model obtained by a binning procedure. 

In the parametric setting with f(-) = 9, the distribution of every "obser- 
vation" Y{ coincides with Pg for some 9 £ and the parameter 9 can be well 
estimated using the parametric maximum likelihood method, 

n 

9 = argmax Vdogp(Yj,#). 

In the nonparametric framework, one usually applies the localization idea. 
In the local constant setting this means that the regression function /(•) 
can be well approximated by a constant within some neighborhood of every 
point x in the "feature" space 3£ . This leads to the local model concentrated 
in some neighborhood of the point x. 
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2.1. Localization. We use localization by weights as a general method to 
describe a local model. Let, for a fixed x, a nonnegative weight W{ = Wi{x) < 
1 be assigned to the observation Yi at JQ, i = l,...,n. The weights Wi(x) 
determine a local model corresponding to the point x in the sense that, 
when estimating the local parameter f(x), every observation Yi is taken 
with weight Wi(x). This leads to the local (weighted) maximum likelihood 
estimate 8 = 6{x) of f(x), 

n 

(2.1) 9(x) = arg max^ Wj(x) logp(Yj, 6). 

eee i=1 

We now mention two possible ways of choosing the weights Wi(x). Local- 
ization by a bandwidth is defined by weights of the form Wi(x) = K\ oc (li) with 
L; = p(x,Xi)/h, where h is a bandwidth, p(x,X{) is the Euclidean distance 
between x and the design point X{ and K\ oc is a location kernel. Localization 
by a window simply restricts the model to a subset (window) U = U(x) of 
the design space which depends on x, that is, Wi(x) = £ Obser- 
vations li with Xj outside the region U(x) are not used for estimating f(x). 
This kind of localization arises, for example, in the classification with k- 
nearest neighbors method or in the regression tree approach. Sometimes it is 
convenient to combine these two methods by defining Wi(x) = i£i oc (lj)l(Xj £ 
U{x)). One example is given by the boundary-corrected kernels. 

We do not assume any special structure for the weights Wi(x), that is, any 
configuration of weights is allowed. We also denote W = W(x) = {w\(x), . . . , 
w n (x)} and 

n 

L(W,0) =J2 w i( x ) l °SP(Xi,0)- 

i=l 

To simplify notation, we do not show the dependence of the weights on x 
explicitly in what follows. 

2.2. Local likelihood estimation for an exponential family model. If & = 
(Pg) is an exponential family with the natural parametrization, the local 
log-likelihood and the local maximum likelihood estimates admit a simple 
closed form representation. For a given set of weights W = diagjwi, . . . , w n } 
with Wi € [0,1], denote 

n n 

N = Yl Wi > S = ^2wiYi. 

i=X i=l 

Note that both sums depend on the location x via the weights {wi}. 
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Lemma 2.1 (Polzehl and Spokoiny [12]). 

n 

L(W, 6) = J2 w i logp(Yi,9) = SC{6) - NB{9) + R, 
i=i 

where R = Yd=i w i l°gp(^)- Moreover, 

n n 

(2.2) 9 = S/N = J2mY i /J2w i 

i=i i=i 

and 

L(W,e,6) := L(W,6) - L(W,6) = NJtr(6,6). 

We now present some exponential inequalities for the "fitted log-likelihood'' 
L(W,9,9) which apply in the parametric situation /(•) = 9 for an arbitrary 
weighting scheme and arbitrary sample size. 

Theorem 2.2 (Polzehl and Spokoiny [12]). Let W = {wi} be a localizing 
scheme such that maxj Wi < 1. If f(X{) = 9* for all Xi with Wi > 0, then for 
any z > 0, 

P * (L(W, 9, 9*) >z)= P e * {Nje{9, 9*)>z)< 2e~ z . 

Remark 2.1. Condition (A2) ensures that the Kullback-Leibler diver- 
gence Jff satisfies J€(9\ 9*) < I*\9' — 9*\ 2 for any point 9' in a neighborhood 
of 9* , where I* is the maximum of the Fisher information over this neighbor- 
hood. Therefore, the result of Theorem 2.2 guarantees that \9 — 9*\ < CN^ 1 / 2 
with high probability. Theorem 2.2 can be used to construct the confidence 
intervals for the parameter 9*. 

Theorem 2.3. If$ a satisfies 2e~ ia < a, then 
£ a = {9':NJfr(6,9')< ia } 
is an a-confidence set for the parameter 9*. 

Theorem 2.2 claims that the estimation loss measured by J^(9',9) is, 
with high probability, bounded by $/N, provided that 3 is sufficiently large. 
Similarly, one can establish a risk bound for a power loss function. 

Theorem 2.4. Assume (Al) and (A2) and let Yi be i.i.d. from Pq* . 
Then, for any r > 0, 

E e ,L r {9,9*) = N r E e *J(r r (9,9*) < r r , 

where r r = 2r J,>o3 r_le_3 ^3 = 2ri~'(r). Moreover, for every A < 1, 

E e * exp{AL(£, 9*)} = E e * exp{XNJt(9,9*)} < 2(1 - A) -1 . 
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Proof. By Theorem 2.2, 

B e *L r (9,9*) < - f fdP e *(L(0,0*)>}) 

< r f f- 1 ^ (L(9, 6*) > 3) di < 2r f f- x e~* d$ 

and the first assertion is satisfied. The last assertion is proved similarly. □ 

2.3. Risk of estimation in the nonparametric situation. "Small modeling 
bias" condition. This section extends the bound of Theorem 2.2 to the non- 
parametric situation where the function /(•) is no longer constant, even in 
the vicinity of the reference point x. We, however, suppose that the function 
/(•) can be well approximated by a constant 9 at all points X{ with positive 
weights uii. To measure the quality of the approximation, define for every 9 

(2.3) A(W, 9)=J2 6(0, f(Xi))l(wi > 0), 

i 

where, with £(y, 9,9') = log , 

5(9,9') = log^e- 2 W) =log^4^T- 

p 2 (Y,9) 

One can easily check that 5(9, 9') < I*\9 — 9'\ 2 , where I* = max^z/g^ g/j 1(9"). 



Theorem 2.5. Let he a a-field generated by the r.v. Yi for which 
Wi > and let A(W,9) < A. Then, for any random variable £ measurable 
with respect to 

E /( .^<(e A E e e 2 ) 1/2 . 

Proof. Define Z w (9) =exp{-£^(y;,6^/(Xi))l(>i > 0)}. This value 
is nothing but the likelihood ratio of the measure P/(.) with respect to 
Pg upon restricting to the observations Y{ for which Wi > 0. Then, for any 
£ ~ i we have P>f(.)£, = Eg!;Z\y(9). Independence of the Yj's implies that 

logE^(0) = ^logE 9 e- 2 ^' e '/^))l(u, > 0) 

i 

= Y^S(0,f(X i ))l(w i >0)<A. 

i 

The result now follows from the Cauchy-Schwarz inequality Pig£ ! Zw(9) < 
{^ e e^eZ 2 w (G)Y 12 - □ 
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This result implies that the bound for the risk of estimation Ej(.)L r (#, 9) = 

N r ~Ef(.yJ(f r (8,8) under the parametric hypothesis can be extended to the 
nonparametric situation, provided that the value A(W, 9) is sufficiently small. 

Corollary 2.6. For any r > and any A < 1, 

ATE /( .)|jr(M)r< Ve A W*>r 2r , 

7V{E /( .)| jer@, 6)\ r } 1/r < ijlog + A(W, 0) + 2(r - 1)+}. 

Proof. The first bound follows directly from Theorems 2.4 and 2.5. 
The proof of the second bound uses the fact that for r > 0, the function 
h(x) = log r (x + c r ) with c r = e( r-1 )+ is concave on (0, oo) because 

= rlog r ~y + c r ) _ 1 _ + < Q 

(x + c r y 

for x > 0. With £ = \L(6, 9)/2, this implies, by monotonicity of the logarithm 
function and Jensen's inequality, that Ej(.)£ r < E^(.)/i(e^) < ^(Ej(.)e^), hence 

E^C < log(E /( . )e ? + c r ) < logE /( . )e f + (r - 1)+ 

<ilog(e A W)^) + (r-l) + 
and the assertion follows. □ 

Corollary 2.6 presents two bounds for the risk of estimation in the non- 
parametric situation which extend the similar parametric bounds by Theo- 
rem 2.5. The risk bound in the parametric situation can be interpreted as 
the bound for the variance of the estimate 9, while the term A(W, 9) controls 
the bias of estimation; see the next section for more details. Both bounds 
formally apply, regardless of what the "modeling bias" A(W,9) is. However, 
the results are meaningful only if this bias is not too large. The first bound 
could be preferable for small values of A(W,9). However, the multiplicative 
factor e^ W ' 9 ^ 2 makes this bound useless for large A(W,9). The advantage 
of the second bound is that the "modeling bias" enters in additive form. 

In the remainder of this section, we briefly comment on relations be- 
tween the results of Section 2.3 and the usual rate results under smoothness 
conditions on the function /(•) and the regularity conditions on the design 
X\ , X n . More precisely, we assume that the weights W{ are supported on 
a ball with radius h > and center x and that the function /(•) is smooth 
within this ball in the sense that for 9* = f(x), 



(2.4) 



5 1/2 {9*,f(x + t))<Lh y\t\<h. 
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In view of the inequality 5(6, 0') < I*\9 — 9'\ 2 , this condition is equivalent to 
the usual Lipschitz property. Obviously, (2.4) implies, with TV = J2i^-( w i > 
0), that 



Combined with the result of Corollary 2.6, these bounds lead to the following 
rate results. 

Theorem 2.7. Assume (Al) and (A2) and let 5 1 / 2 {6* , f(x + t)) < Lh 
for all \t\ < h. Select h = c(L 2 n)~ 1 ^ 2+d ' > for some c > and let the localizing 
scheme W be such that Wi = for all Xi with \Xi — x\ > h, N := J2i w i — 
d\nh d and N := J2i l( w i > 0) < d2nh d with some constants d± < ^2- Then 



This corresponds to the classical accuracy of nonparametric estimation 
for the Lipschitz functions (cf. Fan, Farmen and Gijbels [5]). 

3. Description of the method. We start by describing the setting under 
consideration. Let a point of interest x be fixed. The target of estimation is 
the value f(x) of the regression function at x. The local parametric approach 
described in Section 2 and based on local constant approximation of the 
regression function in a vicinity of the point x strongly relies on the choice 
of the local neighborhood or, more generally, of the set of weights (wi). 
The problem of selecting such weights and constructing an adaptive (data- 
driven) estimate is one of the main issues for practical applications and we 
focus on this problem in this section. 

3.1. Local adaptive estimation. General setup. For a fixed x, we assume 
that an ordered set of localizing schemes = (w^), for k = 1, . . . , K, 

is given. The ordering condition means that > ^ for all i and all 

(k) 

k> k ■ , that is, the degree of locality given by W£ is weakened as k grows. 
See Section 3.3 for some examples. For the popular example of kernel weights 
wf^ = K((Xi — x)/hj c ), this condition means that the bandwidth hje grows 
with k. Also, let {<?(*), k = 1, . . . ,K} be the corresponding set of local likeli- 
hood estimates for 9 = f(x), 



A(W,9*) < L 2 h 2 N. 




n n 
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Due to Theorem 2.2, the value 1/N k can be used to measure the variability 
of the estimate 9^ k \ The ordering condition means, in particular, that N k 
grows and hence the variability of 9^ decreases with k. 

Given the estimates 9^ k \ we consider the larger class of their convex 
combinations, 

9 = + • • • + a K {K) , ai + --- + a K = l, a k > 0, 

where the mixing coefficients a k may depend on the point x. We aim to 
construct a new estimate 9 in this class which performs at least as well as 
the best one in the original family 

3.2. Stagewise aggregation procedure. The adaptive estimate 9 of 9 = 
f(x) is computed sequentially via the following algorithm: 

1. Initialization: #( 1 ) = $( 1 ). 

2. Stagewise aggregation: For k = 2, . . . ,K, 

^) :=7 ^( fe ) + (l- 7fe )^- 1 ), 

with the mixing parameter 7^ being defined for some % k > and a kernel 
ifagQ as 

lk = K ag (m^/ U ), m« :=N k Jtr(m,& k - 1 '>). 

3. Loop: If k < K, then increase k by one and continue with Step 2. Other- 
wise, terminate and set 9 = 9^ K \ 

The idea behind the procedure is quite simple. We start with the first 
estimate 9^' which has the smallest degree of locality but the largest vari- 
ability, of order \/N\. Next, we consider estimates with larger values N k . 
Every current estimate 9^ is compared with the previously constructed es- 
timate 9^ k ~ l \ If the difference is not significant, then the new estimate 9^ 
basically coincides with 9^ k \ Otherwise, the procedure essentially keeps the 
previous value 9 ( - k ~ 1 \ For measuring the difference between the estimates 
flW and 9^ k ~ l \ we use := N k X{9^ k \ 9 {k ~^), which is motivated by 
the results of Theorems 2.2 and 2.3. In particular, a large value of 
means that does not belong to the confidence set corresponding to 

and hence indicates a significant difference between these two estimates. 
To quantify this significance, the procedure utilizes the parameters (critical 
values) 3fc. Their choice is discussed in Section 3.3.1. 

Remark 3.1. If K&g(') is the uniform kernel on [0,1], then 7^ is either 
zero or one, depending on the value of m^. This yields, by induction ar- 
guments, that the final estimate coincides with one of the "weak" estimates 
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0( k \ In this case, our method can be considered a pointwise model selection 
method. 

If the kernel K ag is such that K ag (t) = 1 for t < b with some positive 
b, then the small values of the "test statistic" lead to the aggregated 

estimate 

g(k) =e {k)_ Xhis 

is an important feature of the procedure which 
will be used in our implementation and theoretical study. 

3.3. Parameter choice and implementation details. The implementation 
of the SSA procedure requires the fixing of a sequence of local likelihood 
estimates, the kernel K ag and the parameters fa. The next section gives 
some examples of how the set of localizing schemes can be selected. 

The only important parameters of the method are "critical values" 3^ which 
normalize the "test statistics" m^ k \ Section 3.3.1 describes in detail how 
they can be selected in practice. 

The kernel K &g should satisfy < K a g(t) < 1, be monotone decreasing 
and have support on [0,1]. Further, there should be a positive number b 
such that K ag (t) = 1 for t < b. Our default choice is a piecewise linear kernel 
with 6=1/6 and K ag (t) = (1 — (t — &)+)+. Our numerical results (not shown 
here) indicate that the particular choice of kernel K ag has only a minor effect 
on the final results. 

3.3.1. Choice of the parameters 3^. The "critical values" 3^ define the 
level of significance for the test statistics m^ k \ A proper choice of these 
parameters is crucial for the performance of the procedure. In this section, 
we propose one general approach for selecting them which is similar to the 
bootstrap idea in the hypothesis testing problem. Namely, we select these 
values to provide the prescribed performance of the procedure in the para- 
metric situation (under the null hypothesis). For every step k, we require 
that the estimate 9^ is sufficiently close to the "oracle" estimate 9^ in the 
parametric situation /(•) = #, in the sense that 

(3.1) snpE e *\N k ^(e^,e^)\ r < ar r 

e*ee 

for all k = 2, . . . ,K with r r from Theorem 2.4. In some cases, the risk 
E e ,\N k J{r(9 ( - k \d^)\ r does not depend on 9*. This is the case, for exam- 
ple, when 9 is a shift or scale parameter, as for Gaussian shift, exponential 
and volatility families. It then suffices to check (3.1) for any single point 
9*. In the general situation, the risk E,?* \N k J^(9^ k \ 9^)\ r depends on the 
parameter value 9*. However, our numerical results (not reported here) in- 
dicate that this dependence is minor and usually it suffices to check these 
conditions for one parameter 9* . In particular, for the Bernoulli model con- 
sidered in Section 4 we recommend only checking condition (3.1) for the 
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"least favorable" value 6* = 1/2 corresponding to the largest variance of the 
estimate 8. 

The values a and r in (3.1) are two global parameters. The role of a is 
similar to the level of the test in the hypothesis testing problem, while r 
describes the power of the loss function. A specific choice is subjective and 
depends on the particular application in question. Taking a large r and small 
a would result in an increase of the critical values and therefore improve the 
performance of the method in the parametric situation, with the cost of 
some loss of sensitivity to parameter changes. Theorem 5.1 presents some 
upper bounds for the critical values 3^ as functions of a and r in the form 
ao + a\ log a" 1 + a,2r{K — k) with some coefficients ao, a\ and a-i- We see 
that these bounds depend linearly on r and log a" 1 . For our applications 
to classification, we apply a relatively small value, r = 1/2, because the 
misclassification error corresponds to the bounded loss function. We also 
apply a = 1, although other values in the range [0.5, 1] lead to very similar 
results. Note that in general, the parameters 3& thus defined depend on the 
model considered, design X\,..., X n and localizing schemes , . . . , , 
which, in turn, can differ from point to point. Therefore, an implementation 
of the suggested rule would require separate computation of the parameters 
for every point of estimation. However, in many situations, for example, for 
the regular design, this variation from point to point is negligible and a 
universal set of parameters can be used. It is only important that conditions 
(3.1) are satisfied for all the points. 

3.3.2. Simplified parameter choice. Proposal (3.1) is not constructive: we 
have only K — 1 conditions for choosing K — 1 parameters. Here, we present 
a simplified procedure which is rather easy to implement and is based on 
Monte Carlo simulations. It suggests first identifying the last, value %k using 
the reduced aggregation procedure with only two estimates 0^ K ~^ and 6^: 

sup E„. \N K J(r0l K \d( iK ))\ r < ar r /(K - 1), 

6>*G0 

where 6( iK ) = 7# W + (1 - 7)# (x_1) , 7 = K &z {m./ iK ) and m = N K ,J^{9^ K \ 
Q{K-i)y r-j-,^ £h er values ik are found in the form ik= Ik + i*{K — k) to 
ensure (3.1). This suggestion is justified by Theorem 5.1 from Section 5.1. 

3.3.3. Examples of sequences of local likelihood estimates. This section 
presents some examples and recommendations for the choice of the localizing 
schemes which we also use in our simulation study. Note, however, that 
the choice of 's is not part of the SSA procedure. The procedure applies 
with any choice under some rather mild growth conditions. 

Below, we assume that the design X±, . . . ,X n is supported on the unit 
cube [— l,l] rf . This condition can be easily met by rescaling the design 
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components. We mention two approaches to choosing the localizing scheme 
which are usually used in applications. One is based on a given sequence of 
bandwidths, and the other is based on the nearest neighbor structure of the 
design. In both situations, we assume that a location kernel K\ oc is a non- 
negative function on the unit cube [—1, l] d . In general, we only assume that 
this kernel is decreasing along any radial line, that is, K\ oc (px) > K\ oc (x) for 
any 16 [— 1, l] d and p < 1, and K\ oc {x) = for \x\ > 1. In most applications, 
one applies an isotropic kernel K\ oc which depends only on the norm of x. 
The recommended choice is the Epanechnikov kernel K\ oc (x) = (1 — M 2 )+- 

Bandwidth-based localizing schemes. This approach is recommended for the 
univariate or bivariate equidistant design. Let {hk}^ = i be a finite set of 
bandwidth candidates. We assume that this set is ordered, that is, h\ < h,2 < 
■ • • < hx • Every such bandwidth determines the collection of kernel weights 
u){® = K\ oc ({Xi — x) //ifc), i = 1, . . . ,n. This definition assumes that the same 
localizing bandwidth is applied for all directions in the feature space. In all of 
the examples below, we apply a geometrically increasing sequence of "band- 
widths" /ifc, that is, /ifc+i = a hk for some a > 1. This sequence is uniquely 
determined by the starting value hi, the factor a and the total number K 
of local schemes. The recommended choice of a is (1.25)^ d , although our 
numerical results (not reported here) indicate no significant change in the 
results when any other value of a in the range 1.1 to 1.3 is used. The value h\ 
is to be selected in such a way that the starting estimate 0™ is well defined 
for all points of estimation. In the case of a local constant approximation, 
this value can be taken very small because even one point can be sufficient 
for preliminary estimation. In the case of a regular design, the value h\ is 
of order n~ l l d . The number K of local schemes or, equivalently, of 

"weak" estimates 9^ is largely determined by the values h\ and o, in such 
a way that hx = h\ a K ~ 1 is approximately one, that is, the last estimate 
behaves like a global parametric estimate from the whole sample. The for- 
mula K = alog(/iR- /h\) suggests that K is at most logarithmic in the sample 
size n. 

k-NN-based local schemes. If the design is irregular or the design space is 
high-dimensional (d > 2), then it is useful to apply the local schemes based 
on the £;-nearest neighbor structure of the design. For this approach, an 
increasing sequence {Nk} of integers must be fixed. For a fixed x and every 
k > 1, the bandwidth hk is the minimal one for which the ball of radius 
hk contains at least Nk design points. The weights are again defined by 
wf^ = K\ oc ({Xi — x)/hk). The sequence {Nk} is selected similarly to the 
sequence {hk} in the bandwidth-based approach. One starts with a fixed N± 
and then multiplies it at every step by some factor a > 1: Nk+i = aNk- The 
number of steps K is such that Nk is of order n. 
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One can easily check that the kernel- and fc-NN-based local schemes co- 
incide in the case of univariate regular design. 

4. Application to classification. One observes a training sample (Xj, Yj), 
i = 1, . . . ,rt, with Xi valued in a Euclidean space X = M. d and with known 
class assignment Yi G {0, 1}. Our objective is to construct a discrimination 
rule assigning every point x G X to one of the two classes. The classification 
problem can be naturally treated in the context of a binary response model. 
It is assumed that each observation Yi at Xi is a Bernoulli r.v. with the 
parameter 9 t = f(Xi), that is, P(y ( = 0\Xt) = 1 - f(X t ) and P(Fj = 1|X,) = 
f(Xi). The "ideal" Bayes discrimination rule is l(/(x) > 1/2). Since the 
function /(x) is usually unknown, it is replaced by its estimate 9. If the 
distribution of Xi within the class k has density p^ , then 

Oi = 7TlPl(Xj) / '(ir p Q (Xi) + 7TlPl(Xj)), 

where 7Tjt is the prior probability of Arth population, k = 0, 1. 

Nonparametric methods of estimating the function 6 are typically based 
on local averaging. Two typical examples are given by the fe-nearest neighbor 
(fc-NN) estimate and the kernel estimate. For a given k and every point x 
in 3C , denote by &k{ x ) t ne subset of the design Xi, . . . ,X n containing the 
k nearest neighbors of x. Then the /c-NN estimate of f{x) is defined by 
averaging the observations Yi over ^(x), 

Q(k)^ = k -i J- Yi. 

The definition of the kernel estimate of f(x) involves a univariate kernel 
function K(-) and the bandwidth h, 

m (x) =±K (^P-) Yi/ ± K (^p) ■ 

i=l i=l 

Both methods require the choice of a smoothing parameter (the value k for 
fc-NN and the bandwidth h for the kernel estimate). 

Example 4.1. In this example, we consider the binary classification 
problem with the corresponding class densities po(x) and pi(x) given by 
two-component normal mixtures 

p o (x)=O.20(x;(-l,O),O.5I 2 ) + O.80(x;(l,O),O.5I 2 ), 

Pl {x) = O.50(x; (0, 1), 0.5I 2 ) + O.50(x; (0, -1), 0.5I 2 ), 

where (p(-;fi,U) is the density of the multivariate normal distribution with 
mean vector \x and covariance matrix S and I 2 is the 2x2 unit matrix. 
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Fig. 1. .A sample from the binary response model with the normal mixture class densities 
(left) and results of applying the Bayes discrimination rule to this model (right). 



Figure 1 shows one typical realization of the training sample with 100 
observations in each class (left) and the optimal Bayes classification for a 
testing sample with 1000 observations in each class (right). First, in order 
to illustrate the "oracle" property of the SSA, we compute the pointwise 
misclassification errors for all weak estimates and SSA estimates at four 
boundary points. Figure 2 is obtained using a training sample of size 400, k- 
NN weighting scheme with Ni = 5, Nk = 300, K = 30 and a = 0.5. Further, 
we have carried out 500 simulation runs, each time generating 100 train- 
ing points and 100 testing points. The rates of misclassification on testing 
sets have been averaged thereafter to give the mean misclassification error, 
shown as a dotted reference line in Figure 3. We note here that the critical 
values 

U = 0.0031 + 0.007 *(K -k), k = l,...,K, 

have been computed only once for one design realization and least favorable 
parameter value 9* = 0.5, then used in all runs. The same strategy is used in 
other examples. Next, two "weak" classification methods, fe-NN and kernel 
classifiers, with varying smoothing parameters, are applied to the same data 
set. Figure 3 (top) shows the dependence of the misclassification error on 
the bandwidth for kernel classifiers and on the number of nearest neighbors 
for the /c-NN classifier. 

One can observe that a careful choice of the smoothing parameter is cru- 
cial for getting a reasonable quality of classification. A wrong choice leads to 
a significant increase of the misclassification rate, especially for the kernel 
classifiers. At the same time, the optimal choice can lead to a reasonable 
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Fig. 2. Pointwise misclassification errors (black dots) at four points for all weak es- 
timates used in Example 4-1- Solid reference lines correspond to SSA misclassification 
errors. 



quality of classification, only slightly worse than that of the Bayes decision 
rule. 



Example 4.2. We now consider Example 4.1 with eight additional inde- 
pendent jV(0, l)-distributed nuisance components. So, now X{ = (Xf, . . . , Xf ), 
where 

(xl,xf) ~ p class(l) , (Xf,..., x}°) ~ A/XtfV^o), i 8 ). 

8 

The SSA procedure is now implemented, again using fc-NN weights with the 
number of nearest neighbors exponentially increasing from 5 to 100. The 
results are shown in the bottom row of Figure 3. We again observe that the 
quality of both standard classifiers depends significantly on the choice of the 
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Fig. 3. Misclassification errors as functions of the main smoothing parameter for k-NN 
(right) and kernel (left) classifiers. SSA and Bayes misclassification errors are given as 
reference lines. Top: Example 4-1 (dimension 2). Bottom: Example (dimension 10). 



smoothing parameters. In the high-dimensional situation considered, even 
under the optimal choice, the quality of the dimension-independent Bayes 
classifier is not attained. However, the SSA procedure again performs nearly 
as well as the best fc-NN or kernel classifier. 

Example 4.3 (BUPA liver disorders). We consider the dataset sam- 
pled by BUPA Medical Research Ltd. It consists of seven variables and 
345 observed vectors. The subjects are single male individuals. The first 
five variables are measurements taken by blood tests that are thought to 
be sensitive to liver disorders and which might arise from excessive alcohol 
consumption. The sixth variable is a sort of selector variable. The seventh 
variable is the label indicating the class identity. Among all the observations, 
there are 145 people who belong to the liver-disorder group (corresponding 
to selector number 2) and 200 people who belong to the liver-normal group. 
The BUPA liver disorder data set is notoriously difficult to classify, with 
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Fig. 5. Left/ default events (crosses indicate defaulted firms and circles indicate operat- 
ing ones) are shown depending on the two characteristics of a firm. Right/ leave-one-out 
cross-validation error for the k-NN classifier as a function of the number of nearest neigh- 
bors. The CV error for the SSA classifier is given as a reference line. 



usual error rates at about 30%. We apply SSA, A;-NN and kernel classifiers 
to tackle this problem. In the SSA procedure, the £;-NN weighting scheme 
was employed with the number of fc-NN ranging from 2 to 100. Figure 4 
shows the corresponding leave-one-out cross-validation errors for the above 
methods. One can see that the SSA method is uniformly better than kernel 
or fc-NN classifiers. 



Example 4.4 (Bankruptcy data). The data set from the Compustat 
repository contains statistics concerning bankruptcies (defaults) in the pri- 
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vate sector of the U.S. economy during the period 2000-2005. There are 
14 explanatory variables including different financial ratios, industry indica- 
tors and so on. First, a preliminary analysis is conducted and the two most 
informative variables [equity/total assets ratio and net income/total assets 
ratio (profitability)] are selected. The projection of the default statistics onto 
the corresponding plane is shown in Figure 5. Further, the performance of 
the SSA procedure is compared to the performance of the fe-NN classifier 
with different numbers of nearest neighbors. Namely, leave-one-out cross- 
validation errors are computed for both SSA and fc-NN classification meth- 
ods and the latter one is presented in Figure 5 as a function of the number 
of nearest neighbors. Again, as in previous examples, the quality of classi- 
fication strongly depends on the choice of the parameter k. The adaptive 
SSA procedure provides the performance corresponding to the best possible 
choice of this parameter. 

5. Some theoretical properties of the SSA method. This section dis- 
cusses some important theoretical properties of the proposed aggregating 
procedure. In particular, we establish the "oracle" result which claims that 
the aggregated estimate js, up to a log factor, as good as the best one among 
the considered family {8^ k '} of local constant estimates. 

The majority of the results in the modern statistical literature are stated 
as asymptotic rate results. It is, however, well known that the rate optimal- 
ity of an estimation procedure does not automatically imply its good finite 
sample properties and cannot be used for comparing different procedures. 
Also, the rate results are almost useless for selecting the parameters of the 
procedure. In our theoretical study, we apply another approach which aims 
to link parametric and nonparametric inference with the focus on the adap- 
tive behavior of the proposed method. This means, in particular, that the 
SSA procedure attains parametric accuracy if the parametric assumption is 
satisfied. In the general situation, the procedure attains (up to an unavoid- 
able price for adaptation) the quality corresponding to the best possible 
local parametric approximation for the underlying model near the point of 
interest. 

The "oracle" result is, in turn, a consequence of two important properties 
of the aggregated estimate 9: "propagation" and "stability." "Propagation" 
can be viewed as the oracle result in the parametric situation with /(•) = 9* . 
In this case, the oracle choice would be the estimate with the largest value 
N k , that is, the last estimate 9^ in the family {6» (fc) }. The " propagation" 
property means that at every step k of the procedure, the "aggregated" es- 
timate 9^ is close to the "oracle" estimate 9^ k \ In other words, the "propa- 
gation" property ensures that at every step, the degree of locality is relaxed 
and the local model applied for estimation is extended to a larger neigh- 
borhood described by the weig hts W( k \ The "propagation" property can 
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be naturally extended to a nearly parametric case when A(W^ k \9) is small 
for some fixed 9 and all k <k* . The "propagation" feature of the procedure 
ensures that the quality of estimation improves and confidence bounds for 
become tighter as the number of iterations increases, provided that the 
"small modeling bias" condition still holds. Finally, the "stability" property 
ensures that the quality gained in the "propagation" stage will be main- 
tained for the final estimate. 

Our theoretical study is carried out under assumptions (Al) and (A2) on 
the parametric family & . Additionally, we impose the following assumption 
on the sequence of localizing schemes W^ k ' which was already mentioned in 
Section 3. 

(A3) The set is ordered, in the sense that > ^ for all i and 

all k > k' . Moreover, for some constants uq,u with < uq < u < 1, values 

N k = Y!j=i wf } satisfy, for every 2<k<K, 

u < Nk-jNk < u. 

5.1. Behavior in the parametric situation. First, we consider the homo- 
geneous situation with the constant parameter value f(x) = 9*. Our first 
result claims that in this situation, under assumption (A3), the parameters 
Ik can be chosen in the form Ik = Ik + t(K — k) in order to satisfy the 
"propagation" condition (3.1). The proof is given in the Appendix. 

Theorem 5.1. Assume (Al), (A2) and (A3). Let f(Xi) = 9* for all i. 
Then there are three constants oq, ai and ai, depending on r , uq and u only, 
such that the choice 

U = ao + ai log a' 1 + a 2 r log N k 
ensures (3.1) for all k<K. In particular, ¥ ie ,\N K ^{9^ K \9)\ r 

5.2. "Propagation" under "small modeling bias." We now extend the 
"propagation" result to the situation where the parametric assumption is 
no longer fulfilled, but the deviation from the parametric structure within 
the local model under consideration is sufficiently small. This deviation can 
be measured for the localizing scheme by A(W^ k \9) from (2.3). 

We suppose that there is a number k* such that the modeling bias A(W^ k \ 9) 
is small for some 9 and all k < k* . Consider the corresponding estimate 9^ k *^ 
obtained after the first k* steps of the algorithm. Theorem 2.5 implies, in 
this situation, the following result. 

Theorem 5.2. Assume (Al), (A2) and (A3). Let 9 and k* be such that 
A(W {k \9) < A for some A > and all k < k* . Then 

E /( .)|iV fc *^ fc *),^ fc *))r /2 < yW^, 
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^ f( . ) \N k ,je{9^\9)\ r/2 <^^. 

5.3. "Stability after propagation" and " oracle" results. Due to the "prop- 
agation" result, the procedure performs well, provided the "small modeling 
bias" condition A(W^ k \9) < A is satisfied. To establish the accuracy result 
for the final estimate 6, we have to check that the aggregated estimate 9^ 
does not vary much at the steps "after propagation" when the divergence 
A(W^ k \9) from the parametric model becomes large. 

Theorem 5.3. Under (Al), (A2) and (A3), for every k< K, we have 

(5.1) N k .xr{¥- k \¥- k -^)<i k . 

Moreover, under (A3), for every k' with k < k' < K , we have 

(5.2) N k X{W\m)<K 2 cli k 
with c u = {u~ 1 / 2 — 1) _1 and J k = ma,xi> k $i. 

Remark 5.1. An interesting feature of this result is that it is satisfied 
with probability one, that is, the control of stability not only "works" with 
high probability, it always applies. This property follows directly from the 
construction of the procedure. 

Proof of Theorem 5.3. (The convexity of the Kullback-Leibler di- 
vergence J(f(u,v)) with respect to the first argument implies 

jtrfiik) ; Q(k-i) ) < lk jf{¥ k) , e**- 1 ) ) . 

If JT(^ A: ),^ fc " 1 )) > u/N k , then 7fc = and (5.1) follows. Now, assumption 
(A2) and Lemma A.l yield 

k' k' 
l=k+l l=k+l 

The use of assumption (A3) leads to the bound 

^@ k \eM)<x(i k /N k )^ £) u^K^il-^ilk/Nk) 1 ' 2 , 

l=k+l 

which proves (5.2). □ 

A combination of the "propagation" and "stability" statements implies 
the main result concerning the properties of the adaptive estimate 9. 
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Theorem 5.4. Assume (Al), (A2) and (A3). Let k* be a "good" choice, 
in the sense that 

max 

k<k* 

for some and some value A. Then 

v m \N k *jf{W\e)\ r ' 2 < 2^ 1 )+^{ v /w I + (c 2 u - 3k *) r/2 }, 

where c u is the constant from Theorem 5.3. 

We also present a corollary of the "oracle" result concerning the risk of 
the adaptive estimate for the special case where r = 1. Other values of r 
can be considered as well: one only has to update the constants depending 
on r. We also assume that a < 1. 

Corollary 5.5. Let maxk<k* A(W^ k \6) < A for some 6 and some A. 
Then 

Proof. Simply observe that by Lemma A.l, 
je l/2 {9,9) 

<Jjf l i 2 (W\e) + j(f l i 2 (W\e^)+ £ jr 1 / 2 ^),^- 1 ))! 
I i=k*+i ) 

and follow the proof of Theorem 5.3. □ 

Remark 5.2. Recall that in the parametric situation, the risk Eg* | N k * x 
<%r(6( h *\0*)\ 1 / 2 of 6^ is bounded by r 1/2 (cf. Theorem 2.2). In the non- 
parametric situation, the result is only slightly worse: the value T\/2 is re- 
placed by y Tie A , which takes into account the modeling bias. There is also 
an additional term proportional to \/3fe*> which can be considered as the 
payment for adaptation. Due to Theorem 5.1, i k * is bounded from above by 
Ik + L {K — k*). By Theorem 5.1, K is only logarithmic in the sample size 
n. 

Therefore, the risk of the aggregated estimate corresponds to the best pos- 
sible risk among the family {6^} for the choice k = k* up to a logarithmic 
factor. Lepski, Mammen and Spokoiny [8] established a similar result in the 
regression setting for the pointwise adaptive Lepski procedure. Combining 
the result of Corollary 5.5 with Theorem 2.7 yields the rate of adaptive esti- 
mation (n" 1 logn) 1 /^ 2+d ) under Lipschitz smoothness of the function / and 
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the usual design regularity; see Polzehl and Spokoiny [12] for more details. 
It was shown that in the problem of pointwise adaptive estimation, this rate 
is optimal and cannot be improved by any estimation method. This gives 
an indirect proof of the optimality of our procedure: the factor % k * in the 
accuracy of estimation cannot be removed or reduced in the rate because 
otherwise a similar improvement would appear in the rate of estimation. 

APPENDIX: PROOF OF THEOREM 5.1 
The proof utilizes the following simple "metric-like" property of J^ 1 ^ 2 ^, •). 

Lemma A.l (Polzehl and Spokoiny [12], Lemma 5.2). Under condition 
(A2), it holds for every sequence 9q, 9\, . . . , 6 m that 

Jtr 1/2 (e 1 ,e 2 ) < xi^iOuOo) + ^ 2 (e 2 ,e )}, 
Jf 1/2 (e , e m ) < x{Jfr 1/2 {9 Q ,o x ) + • • • + ^ 2 {e m ^,e m )}. 

With the given constants 3^, define for k > 1 the random sets 

= {N k jf(¥ k \ 9^) <b ik }, ^ {k) = ^2 n • • • n 

where b enters into the construction of K ag : K a g(t) = 1 for t < b. 

First note that 9^ = 0< fc ) on ^ for all k < K. This fact can be proved by 
induction on k. For k = 1 the assertion is trivial because 9^ = 9^\ Now sup- 
pose that e^- 1 ) = 9( k -V . It then holds on s/ k that mW = N k JfT(fi^,e^ k -^) = 
N k Jf{9^ k \9^) < b U and thus lk = K^m^/u) > K ag (b) = 1, yielding 

Q(k) = Q(k)_ 

, ( k \ 

Therefore, it remains to bound the risk of 9^ ' on the complement of 

£?( k \ Define SS k = ^ k ~^ \ On the event & k , the index k is the first 

one for which the condition N k Jf{9^ k \ #( fc_1 )) < b$ k is violated. It is obvious 
( k \ 

that stf ={J l<k &i. First, we bound the probability ~Pe*(3Si). Applying 
assumption (A3) and Lemma A.l yields, for every I, 

N l J^(9 {l \9 {l ~ 1) )<2x 2 Ni{Jt{9 (l \9*)+jr(9 {l ~ 1) ,9*)} 

< 2M 2 {NiJf(9 {l) ,9*)+Uo 1 N l _ 1 jr(9 i - l - 1 \9*)}. 

Therefore, by Theorem 2.2, 

P*.(# { ) < P^^JT^),^- 1 )) > b h ) < 2exp(--^A 

On the set we have 9 ( - l ~ 1 ^ = and thus for every k > I, the aggregated 

estimate 9^ is, by construction, a convex combination of 9( l ~ 1 \ . . . , 9^ k \ 
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Convexity of the Kullback-Leibler divergence with respect to the second 
argument, the definition of 8^ and Lemma A.l ensure that 

jrV 2 (0( fe ),0( fe ))l(^)< max JT 1 / 2 ^), 0(0) 

l'=l— l,...,k— 1 

<x max {jr 1/2 (^ (fc) ,r) + jr 1 / 2 (0(0,0*)) 
/'=«-i,...,fe-i 

<2x max jr 1/2 (fl (/,) , fl* ). 
l'=l-i,...,k 

This and Theorem 2.4 imply that for every r, 

k 

i'=i-i 

k 

<{2x) 2r ]T E^ 2 JT 2r (0(O ; 0*) P V 2 (^) 
l'=l-i 

<(2x) 2 V 2 f £ iV^expf-f^ 

<C 1 iVf r r 2 1 r /2 exp(-c 23/ ) 
for some fixed constants C\ and C2- Therefore, 

E e ,^(0( fc ),0( fc )) <J2^9*^ r (0 {k \O {k) nm < EC 1 7Vr r -r 2 1 r /2 exp(- C23/ ). 

It remains to check that the choice 3& = ao + ai loga -1 + log(Nx / Nj<.) , 
with properly selected ao, ai and 02, provides the required bound Eg. | iVfe Jt{Q^ , 

0W)| r <QT r . 

REFERENCES 

[1] Belomestny, D. and Spokoiny, V. (2006). Spatial aggregation of local likelihood 
estimates with applications to classification. SFB 649 Discussion Paper 2006- 
036. 

[2] Breiman, L. (1996). Stacked regressions. Machine Learning 24 49-64. 

[3] Cai, Z., Fan, J. and Li, R. (2000). Efficient estimation and inference for varying- 

coefficient models. J. Amer. Statist. Assoc. 95 888-902. MR1804446 
[4] CATONI, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture 

Notes in Math. 1851. Springer, Berlin. MR2163920 
[5] Fan, J., Farmen, M. and Gijbels, I. (1998). Local maximum likelihood estimation 

and inference. J. 7?. Stat. Soc. Ser. B Stat. Methodol. 60 591-608. MR1626013 
[6] Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. 

Ann. Statist. 27 1491-1518. MR1742497 



26 



D. BELOMESTNY AND V. SPOKOINY 



[7] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric 
estimation. Ann. Statist. 28 681-712. MR1792783 

[8] Lepski, O., Mammen, E. and Spokoiny, V. (1997). Optimal spatial adaptation 
to inhomogeneous smoothness: An approach based on kernel estimates with 
variable bandwidth selectors. Ann. Statist. 25 929-947. MR1447734 

[9] Lepski, O. and Spokoiny, V. (1997). Optimal pointwise adaptive methods in non- 
parametric estimation. Ann. Statist. 25 2512-2546. MR1604408 
[10] Li, J. and Barron, A. (1999). Mixture density estimation. In Advances in Neural 
Inforamtion Processing Systems 12 (S. A. Sola, T. K. Leen and K. R. Miiller, 
eds.). Morgan Kaufmann Publishers, San Mateo, CA. 
[11] Loader, C. R. (1996). Local likelihood density estimation. Ann. Statist. 24 1602- 
1618. MR1416652 

[12] Polzehl, J. and Spokoiny, V. (2006). Propagation-separation approach for local 
likelihood estimation. Probab. Theory Related Fields 135 335-362. MR2240690 

[13] Rigollet, Ph. and Tsybakov, A. (2005). Linear and convex aggregation of density 
estimators. Manuscript. 

[14] Spokoiny, V. (1998). Estimation of a function with discontinuities via local polyno- 
mial fit with an adaptive window choice. Ann. Statist. 26 1356-1378. MR1647669 

[15] Staniswalis, J. G. (1989). The kernel estimate of a regression function in likelihood- 
based models. J. Amer. Statist. Assoc. 84 276-283. MR0999689 

[16] Tibshirani, R. and Hastie, T. J. (1987). Local likelihood estimation. Amer. Statist. 
Assoc. 82 559-567. MR0898359 

[17] Tsybakov, A. (2003). Optimal rates of aggregation. In Computational Learning 
Theory and Kernel Machines (B. Scholkopf and M. Warmuth, eds.) 303-313. 
Lecture Notes in Artificial Intelligence 2777. Springer, Heidelberg. 

[18] Yang, Y. (2004). Aggregating regression procedures to improve performance. 
Bernoulli 10 25-47. MR2044592 



Weierstrass Institute 
mohrenstrasse 39 
10117 Berlin 
Germany 

E-MAIL: belomest@wias-berlin.dc 
spokoiny@wias-bcrlin.dc 



